{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WGH9m3aoAYXp"
   },
   "source": [
    "# Q&A with finetuned BERT \n",
    "\n",
    "In this section, let's learn how to perform question answering with a finetuned Q&A BERT. First, let us import the necessary modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true,
    "id": "mu4z4uVSAZoX"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install transformers==3.5.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "id": "0hpJg1FbAYXy"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import BertForQuestionAnswering, BertTokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lPJaYZzmAYX0"
   },
   "source": [
    "\n",
    "Now, we download and load the model. We use the bert-large-uncased-whole-word-masking-finetuned-squad model which is finetuned on the SQUAD (Stanford question answering dataset). \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "id": "U4rF-fcHAYX2"
   },
   "outputs": [],
   "source": [
    "model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4VIr9V1qAYX2"
   },
   "source": [
    "\n",
    "Next, we download and load the tokenizer which is used for pretraining the bert-large-uncased-whole-word-masking-finetuned-squad model: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "id": "gI6MVxD7AYX3"
   },
   "outputs": [],
   "source": [
    "tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jtIJnwohAYX3"
   },
   "source": [
    "\n",
    "Now that we downloaded the model and tokenizer, let's preprocess the input. \n",
    "\n",
    "## Preprocessing the input\n",
    "First, we define the input to the BERT which is question and paragraph text:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "id": "c_M3V_kVAYX3"
   },
   "outputs": [],
   "source": [
    "question = \"What is the immune system?\"\n",
    "paragraph = \"The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3w6MAGh_AYX4"
   },
   "source": [
    "Add [CLS] token to the beginning of the question and [SEP] token at the end of both the question and paragraph: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true,
    "id": "hsn5nIiBAYX4"
   },
   "outputs": [],
   "source": [
    "question = '[CLS] ' + question + '[SEP]'\n",
    "paragraph = paragraph + '[SEP]'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "h8HkGXVSAYX4"
   },
   "source": [
    "\n",
    "Now, tokenize the question and paragraph: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true,
    "id": "JpOn-vyHAYX4"
   },
   "outputs": [],
   "source": [
    "question_tokens = tokenizer.tokenize(question)\n",
    "paragraph_tokens = tokenizer.tokenize(paragraph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5QVFGvnZAYX5"
   },
   "source": [
    "\n",
    "\n",
    "Combine the question and paragraph tokens and convert them to input_ids:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true,
    "id": "y5Eo9QhAAYX5"
   },
   "outputs": [],
   "source": [
    "tokens = question_tokens + paragraph_tokens \n",
    "input_ids = tokenizer.convert_tokens_to_ids(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "55nyUd8CAYX5"
   },
   "source": [
    "\n",
    "\n",
    "Next, we define the segment_ids. The segment_ids will be 0 for all the tokens of question and it will be 1 for all the tokens of the paragraph:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true,
    "id": "bhqMUQxJAYX5"
   },
   "outputs": [],
   "source": [
    "segment_ids = [0] * len(question_tokens)\n",
    "segment_ids += [1] * len(paragraph_tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mssGDCivAYX6"
   },
   "source": [
    "\n",
    "Now we convert the input_ids and segment_ids to tensor: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true,
    "id": "RFTYxYotAYX6"
   },
   "outputs": [],
   "source": [
    "input_ids = torch.tensor([input_ids])\n",
    "segment_ids = torch.tensor([segment_ids])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "e48VMzbCAYX6"
   },
   "source": [
    "\n",
    "\n",
    "Now that we processed the input. Let's feed them to the model and get the result. \n",
    "\n",
    "## Getting the answer\n",
    "We feed the input_ids and segment_ids to the model which return the start score and end score for all of the tokens: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true,
    "id": "-dBxzkQsAYX6"
   },
   "outputs": [],
   "source": [
    "start_scores, end_scores = model(input_ids, token_type_ids = segment_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d5zRrjE8AYX6"
   },
   "source": [
    "\n",
    "Now, we select the start_index which is the index of the token which has a maximum start score and end_index which is the index of the token which has a maximum end score: \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true,
    "id": "9AUj2-jMAYX7"
   },
   "outputs": [],
   "source": [
    "start_index = torch.argmax(start_scores)\n",
    "end_index = torch.argmax(end_scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rdePQpGIAYX7"
   },
   "source": [
    "\n",
    "That's it! Now, we print the text span between the start and end index as our answer: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "Z8pNAGoYAYX7",
    "outputId": "b58dce9d-5a87-48dc-c7ba-60b62112e304"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a system of many biological structures and processes within an organism that protects against disease\n"
     ]
    }
   ],
   "source": [
    "print(' '.join(tokens[start_index:end_index+1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JLMafolhAYX7"
   },
   "source": [
    "\n",
    "Now that we learned how to finetune BERT for the question answering task, in the next section, we learn how to finetune BERT for the named-entity recognition task. \n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "3.09. Q&A with finetuned BERT .ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
